Much of the code and examples are copied/modified from
Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler (O'Reilly, 2021), 978-1-492-07408-3.
%run "/code/source/config/notebook_settings.py"
pd.set_option('display.max_colwidth', None)
from source.library.text_analysis import count_tokens, tf_idf, get_context_from_keyword, \
count_keywords, count_keywords_by, impurity
with Timer("Loading Data"):
path = 'artifacts/data/processed/reddit.pkl'
df = pd.read_pickle(path)
2023-02-26 22:48:32 - INFO | Timer Started: Loading Data 2023-02-26 22:48:33 - INFO | Timer Finished: (0.18 seconds)
This section provides a basic exploration of the text and dataset.
df.head(1)
| id | subreddit | title | post | impurity | post_clean | all_lemmas | partial_lemmas | bi_grams | adjs_verbs | nouns | noun_phrases | entities | post_length | num_tokens | language | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 74qv99 | Honda | J32A3 Block with J35Z2 Crank and Rods? | Hello, <lb><lb>I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 3.5. I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G 2004 Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3.<lb><lb>Thanks in advance. | 0.01 | Hello, I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 _NUMBER_ . I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G _NUMBER_ Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3. Thanks in advance. | [hello, i, have, my, j32a3, egine, open, ready, to, put, new, ring, and, all, i, would, like, to, know, if, i, can, swap, the, j35z2, crank, and, rod, from, an, accord, sedan, v6, _number_, i, would, like, to, keep, my, piston, just, change, the, rod, and, crank, to, have, a, little, more, displacement, this, be, a, doable, option, there, be, something, i, need, to, know, first, this, be, for, my, acura, tl, 3, g, _number_, manual, transmission, i, really, would, like, to, know, if, it, be, possible, and, if, possible, what, difference, be, between, then, j35z2, and, let, us, say, ...] | [hello, j32a3, egine, open, ready, new, ring, like, know, swap, j35z2, crank, rod, accord, sedan, v6, _number_, like, piston, change, rod, crank, little, displacement, doable, option, need, know, acura, tl, 3, g, _number_, manual, transmission, like, know, possible, possible, difference, j35z2, let, classic, j35a3, thank, advance] | [j32a3-egine, egine-open, new-ring, j35z2-crank, accord-sedan, sedan-v6, v6-_number_, doable-option, acura-tl, tl-3, 3-g, g-_number_, _number_-manual, manual-transmission, classic-j35a3] | [open, ready, new, like, know, swap, like, change, little, doable, need, know, like, know, possible, possible, let, classic] | [j32a3, egine, ring, j35z2, crank, rod, accord, sedan, v6, _number_, piston, rod, crank, displacement, option, acura, tl, g, _number_, manual, transmission, difference, j35z2, j35a3, thank, advance] | [egine-open, new-ring, j35z2-crank, doable-option, classic-j35a3] | [Accord Sedan (PRODUCT), first (ORDINAL), Acura (ORG), 3 (CARDINAL), g _NUMBER_ Manual Transmission (ORG)] | 542 | 46 | English |
hlp.pandas.numeric_summary(df)
| # of Non-Nulls | # of Nulls | % Nulls | # of Zeros | % Zeros | Mean | St Dev. | Coef of Var | Skewness | Kurtosis | Min | 10% | 25% | 50% | 75% | 90% | Max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| impurity | 5,000 | 0 | 0.0% | 1,023 | 20.0% | 0.0 | 0.0 | 1.0 | 1.9 | 7.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.2 |
| post_length | 5,000 | 0 | 0.0% | 0 | 0.0% | 679.1 | 452.5 | 0.7 | 2.5 | 9.2 | 256.0 | 307.0 | 381.0 | 538.0 | 812.2 | 1,217.1 | 4,174.0 |
| num_tokens | 5,000 | 0 | 0.0% | 0 | 0.0% | 52.5 | 35.2 | 0.7 | 2.6 | 10.0 | 12.0 | 24.0 | 30.0 | 42.0 | 63.0 | 93.0 | 365.0 |
hlp.pandas.non_numeric_summary(df)
| # of Non-Nulls | # of Nulls | % Nulls | Most Freq. Value | # of Unique | % Unique | |
|---|---|---|---|---|---|---|
| id | 5,000 | 0 | 0.0% | 74qv99 | 5,000 | 100.0% |
| subreddit | 5,000 | 0 | 0.0% | Lexus | 20 | 0.4% |
| title | 5,000 | 0 | 0.0% | Need some advice | 4,995 | 99.9% |
| post | 5,000 | 0 | 0.0% | Hello, <lb><lb>I have my J32A3[...] | 5,000 | 100.0% |
| post_clean | 5,000 | 0 | 0.0% | Hello, I have my J32A3 egine o[...] | 5,000 | 100.0% |
| all_lemmas | 5,000 | 0 | 0.0% | ['hello', 'i', 'have', 'my', '[...] | 5,000 | 100.0% |
| partial_lemmas | 5,000 | 0 | 0.0% | ['hello', 'j32a3', 'egine', 'o[...] | 5,000 | 100.0% |
| bi_grams | 5,000 | 0 | 0.0% | ['j32a3-egine', 'egine-open', [...] | 5,000 | 100.0% |
| adjs_verbs | 5,000 | 0 | 0.0% | ['open', 'ready', 'new', 'like[...] | 5,000 | 100.0% |
| nouns | 5,000 | 0 | 0.0% | ['j32a3', 'egine', 'ring', 'j3[...] | 5,000 | 100.0% |
| noun_phrases | 5,000 | 0 | 0.0% | [] | 4,947 | 98.9% |
| entities | 5,000 | 0 | 0.0% | [] | 4,565 | 91.3% |
| language | 4,996 | 4 | 0.1% | English | 1 | 0.0% |
df['post'].iloc[0][0:1000]
"Hello, <lb><lb>I have my J32A3 egine open, ready to put new rings and all. I would like to know if I can swap the J35Z2 crank and rods from an Accord Sedan v6 3.5. I would like to keep my pistons, just change the rods and crank to have a little more displacement. This is a doable option? there is something I need to know first? This is for my Acura TL 3G 2004 Manual Transmission. I really would like to know if it's possible and if possible what differences are between then J35Z2 and let's say the classic J35A3.<lb><lb>Thanks in advance."
'|'.join(df['partial_lemmas'].iloc[0])[0:1000]
'hello|j32a3|egine|open|ready|new|ring|like|know|swap|j35z2|crank|rod|accord|sedan|v6|_number_|like|piston|change|rod|crank|little|displacement|doable|option|need|know|acura|tl|3|g|_number_|manual|transmission|like|know|possible|possible|difference|j35z2|let|classic|j35a3|thank|advance'
'|'.join(df['bi_grams'].iloc[0])[0:1000]
'j32a3-egine|egine-open|new-ring|j35z2-crank|accord-sedan|sedan-v6|v6-_number_|doable-option|acura-tl|tl-3|3-g|g-_number_|_number_-manual|manual-transmission|classic-j35a3'
'|'.join(df['noun_phrases'].iloc[0])[0:1000]
'egine-open|new-ring|j35z2-crank|doable-option|classic-j35a3'
ax = df['impurity'].plot(kind='box', vert=False, figsize=(10, 1))
ax.set_title("Distribution of Post Impurity")
ax.set_xlabel("Impurity")
ax.set_yticklabels([])
ax;
df[['impurity', 'post', 'post_clean']].sort_values('impurity', ascending=False).head()
| impurity | post | post_clean | |
|---|---|---|---|
| 4684 | 0.18 | I'm looking to lease an a4 premium plus automatic with the nav package.<lb><lb>Vehicle Price:<tab><tab>$49,150.00<tab> <lb> <tab>AutoNation Savings:<tab>-<tab>$3,867.00<tab> <lb> <tab>AutoNation Price:<tab><tab>$45,283.00<tab> <lb> <tab> <tab> <lb> <tab>Sales Tax (estimate):<tab>+<tab>$2,734.98<tab> <lb> <tab>Title Fee:<tab>+<tab>$100.00<tab> <lb> <tab>Tire/Battery/MVWEA:<tab>+<tab>$4.00<tab> <lb> <tab>Tag/Registration Fees (estimate):<tab>+<tab>$207.00<tab> <lb> <tab>Electronic Filing:<tab>+<tab>$20.00<tab> <lb> <tab>Other:<tab>+<tab>$20.00<tab> <lb> <tab>Documentation Fee:<tab>+<tab>$300.00<tab> <lb> <tab>Balance Due (estimate):<tab><tab>$48,668.98<tab> <tab>No Trade-In<lb><lb>LEASE OPTIONS<lb>Cash Due<tab>36 months <tab>42 months <lb><lb>$2,000 <tab>$723<tab>$690<lb>$4,000 <tab>$663<tab>$639<lb>$6,000 <tab>$603<tab>$587<lb><lb><lb>This is my first lease, do these numbers look good? Should I push back or negotiate on anything?<lb><lb>Thanks! | I'm looking to lease an a4 premium plus automatic with the nav package. Vehicle Price: $ _NUMBER_ AutoNation Savings: $ _NUMBER_ AutoNation Price: $ _NUMBER_ Sales Tax (estimate): $ _NUMBER_ Title Fee: $ _NUMBER_ Tire/Battery/MVWEA: $ _NUMBER_ Tag/Registration Fees (estimate): $ _NUMBER_ Electronic Filing: $ _NUMBER_ Other: $ _NUMBER_ Documentation Fee: $ _NUMBER_ Balance Due (estimate): $ _NUMBER_ No Trade-In LEASE OPTIONS Cash Due _NUMBER_ months _NUMBER_ months $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ $ _NUMBER_ This is my first lease, do these numbers look good? Should I push back or negotiate on anything? Thanks! |
| 1287 | 0.17 | Bulbs Needed:<lb><lb><lb>**194 LED BULB x8**<lb><lb>4- DOORS<lb><lb>2- MAP LIGHTS<lb><lb>2- VANITY<lb><lb><lb>**3022 LED BULB x3**<lb><lb>2- CARGO DOOR<lb><lb>1- DOME LIGHT<lb><lb><lb>**BULBS USED:**<lb><lb>[194 LED BULBS](https://goo.gl/Jfu2Dx)<lb><lb>[3022 LED BULBS](https://goo.gl/fPgk6n)<lb><lb>[Trim Tools](https://goo.gl/hjxw8Z)<lb><lb>Parts list courtesy of [The Blue TRD](https://www.youtube.com/watch?v=CBJxfWdbEfo&t=28s) from his You Tube Channel.<lb><lb>Just passing along the helpful info. | Bulbs Needed: ** _NUMBER_ LED BULB x8** _NUMBER_ - DOORS _NUMBER_ - MAP LIGHTS _NUMBER_ - VANITY ** _NUMBER_ LED BULB x3** _NUMBER_ - CARGO DOOR _NUMBER_ - DOME LIGHT **BULBS USED:** _NUMBER_ LED BULBS _NUMBER_ LED BULBS Trim Tools Parts list courtesy of The Blue TRD from his You Tube Channel. Just passing along the helpful info. |
| 142 | 0.15 | Breakdown below:<lb><lb>Elantra GT<lb><lb>2.0L 4-cylinder<lb><lb>6-speed Manual Transmission<lb><lb>$19,350<lb><lb>Elantra GT<lb><lb>2.0L 4-cylinder<lb><lb>6-speed Automatic Transmission w/ SHIFTRONIC®<lb><lb>$20,350<lb><lb>Elantra GT Sport<lb><lb>1.6L Turbo GDI 4-cylinder<lb><lb>6-speed Manual Transmission<lb><lb>$23,250<lb><lb>Elantra GT Sport<lb><lb>1.6L Turbo GDI 4-cylinder<lb><lb>7-speed EcoShift® Dual Clutch Transmission w/ SHIFTRONIC®<lb><lb>$24,350 | Breakdown below: Elantra GT _NUMBER_ .0L _NUMBER_ -cylinder _NUMBER_ -speed Manual Transmission $ _NUMBER_ Elantra GT _NUMBER_ .0L _NUMBER_ -cylinder _NUMBER_ -speed Automatic Transmission w/ SHIFTRONIC® $ _NUMBER_ Elantra GT Sport _NUMBER_ .6L Turbo GDI _NUMBER_ -cylinder _NUMBER_ -speed Manual Transmission $ _NUMBER_ Elantra GT Sport _NUMBER_ .6L Turbo GDI _NUMBER_ -cylinder _NUMBER_ -speed EcoShift® Dual Clutch Transmission w/ SHIFTRONIC® $ _NUMBER_ |
| 3174 | 0.13 | E-price:<lb>$20,863.00<lb>Freight:<lb>$900.00<lb>Processing Fee:<lb>$299.00<lb>Total before tax and tag fees:<lb>$22,062.00<lb>7% State SALES TAX: $ 1,544.34<lb>2 YEAR TAG FEES: $187.00<lb>TITLE: $100.00<lb>REGISTRATION: $20.00<lb>LIEN: $20.00<lb>INSPECTION: $25.00 <lb>State TEMP TAG: $20.00<lb>State TIRE FEE: $4.00<lb>TOTAL OUT THE DOOR YOU REQUESTED: $ 23,982.34<lb> | E-price: $ _NUMBER_ Freight: $ _NUMBER_ Processing Fee: $ _NUMBER_ Total before tax and tag fees: $ _NUMBER_ _NUMBER_ % State SALES TAX: $ _NUMBER_ _NUMBER_ YEAR TAG FEES: $ _NUMBER_ TITLE: $ _NUMBER_ REGISTRATION: $ _NUMBER_ LIEN: $ _NUMBER_ INSPECTION: $ _NUMBER_ State TEMP TAG: $ _NUMBER_ State TIRE FEE: $ _NUMBER_ TOTAL OUT THE DOOR YOU REQUESTED: $ _NUMBER_ |
| 3678 | 0.12 | The lease on my 2014 C250 is ending and I have the option to buy it for $20k.<lb>So I decided to see what else is out there for about the same amount of money and so far this are my options:<lb> <lb> <lb>Model: C250<lb>Year: 2014<lb>Miles: 7k<lb>Price: $20k<lb> <lb>Model: E350<lb>Year: 2010<lb>Miles: 59k<lb>Price: $19k<lb> <lb>Model: ML350<lb>Year: 2006<lb>Miles: 42k<lb>Price: $13k<lb> <lb>Thoughts? | The lease on my _NUMBER_ C250 is ending and I have the option to buy it for $20k. So I decided to see what else is out there for about the same amount of money and so far this are my options: Model: C250 Year: _NUMBER_ Miles: 7k Price: $20k Model: E350 Year: _NUMBER_ Miles: 59k Price: $19k Model: ML350 Year: _NUMBER_ Miles: 42k Price: $13k Thoughts? |
df['language'].value_counts(ascending=False)
English 4996 Name: language, dtype: int64
df['subreddit'].value_counts(ascending=False)
Lexus 266 Hyundai 263 Trucks 262 Honda 261 MPSelectMiniOwners 260 mercedes_benz 259 mazda3 257 Harley 255 volt 252 Volkswagen 252 Audi 252 teslamotors 250 Volvo 249 Mustang 248 BMW 239 saab 239 4Runner 238 Porsche 236 subaru 233 Wrangler 229 Name: subreddit, dtype: int64
Explore idiosyncrasies of various columns, e.g. same speaker represented multiple ways.
remove_tokens = {'_number_', 'car'}
count_tokens(df['partial_lemmas'], remove_tokens=remove_tokens).head(10)
| frequency | |
|---|---|
| token | |
| look | 2776 |
| like | 2355 |
| drive | 1880 |
| know | 1812 |
| new | 1738 |
| want | 1687 |
| buy | 1556 |
| thank | 1497 |
| work | 1467 |
| think | 1459 |
ax = df['post_length'].plot(kind='box', vert=False, figsize=(10, 1))
ax.set_title("Distribution of Post Length")
ax.set_xlabel("# of Characters")
ax.set_yticklabels([])
ax;
ax = df['post_length'].plot(kind='hist', bins=60, figsize=(10, 2));
ax.set_title("Distribution of Post Length")
ax.set_xlabel("# of Characters")
ax;
import seaborn as sns
sns.displot(df['post_length'], bins=60, kde=True, height=3, aspect=3);
where = df['subreddit'].isin([
'Lexus',
'mercedes_benz',
'Audi',
'Volvo',
'BMW',
])
g = sns.catplot(data=df[where], x="subreddit", y="post_length", kind='box')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)
g = sns.catplot(data=df[where], x="subreddit", y="post_length", kind='violin')
g.fig.set_size_inches(6, 3)
g.fig.set_dpi(100)
counts_df = count_tokens(df['partial_lemmas'], remove_tokens=remove_tokens)
def plot_wordcloud(frequency_dict):
wc = wordcloud.WordCloud(background_color='white',
#colormap='RdYlGn',
colormap='tab20b',
width=round(hlpp.STANDARD_WIDTH*100),
height=round(hlpp.STANDARD_HEIGHT*100),
max_words = 200, max_font_size=150,
random_state=42
)
wc.generate_from_frequencies(frequency_dict)
fig, ax = plt.subplots(figsize=(hlpp.STANDARD_WIDTH, hlpp.STANDARD_HEIGHT))
ax.imshow(wc, interpolation='bilinear')
#plt.title("XXX")
plt.axis('off')
plot_wordcloud(counts_df.to_dict()['frequency']);
tf_idf_lemmas = tf_idf(
df=df,
tokens_column='partial_lemmas',
segment_columns = None,
min_frequency_corpus=20,
min_frequency_document=20,
remove_tokens=remove_tokens,
)
tf_idf_lemmas.head()
| frequency | tf-idf | |
|---|---|---|
| token | ||
| look | 2776 | 3043.65 |
| drive | 1880 | 2867.85 |
| like | 2355 | 2830.75 |
| new | 1738 | 2577.63 |
| mile | 1406 | 2510.05 |
remove_tokens_bi_grams = {'_number_ year', '_number_ _number_', 'hey guy'}
tf_idf_bi_grams = tf_idf(
df=df,
tokens_column='bi_grams',
segment_columns = None,
min_frequency_corpus=20,
min_frequency_document=20,
remove_tokens=remove_tokens_bi_grams,
)
tf_idf_bi_grams.head()
| frequency | tf-idf | |
|---|---|---|
| token | ||
| $-_number_ | 1089 | 2366.46 |
| _number_-mile | 553 | 1417.20 |
| _number_-year | 397 | 1123.63 |
| _number_-_number_ | 285 | 890.38 |
| look-like | 229 | 749.94 |
tf_idf_nouns = tf_idf(
df=df,
tokens_column='nouns',
segment_columns = None,
min_frequency_corpus=20,
min_frequency_document=20,
remove_tokens=remove_tokens,
)
tf_idf_nouns.head()
| frequency | tf-idf | |
|---|---|---|
| token | ||
| mile | 1404 | 2507.99 |
| issue | 1142 | 2240.32 |
| year | 1227 | 2226.76 |
| time | 1289 | 2216.42 |
| engine | 944 | 2100.66 |
tf_idf_noun_phrases = tf_idf(
df=df,
tokens_column='noun_phrases',
segment_columns = None,
min_frequency_corpus=20,
min_frequency_document=20,
remove_tokens=remove_tokens_bi_grams,
)
tf_idf_noun_phrases.head()
| frequency | tf-idf | |
|---|---|---|
| token | ||
| oil-change | 132 | 514.63 |
| new-car | 130 | 502.27 |
| test-drive | 110 | 442.43 |
| engine-light | 101 | 439.81 |
| year-old | 105 | 426.65 |
ax = tf_idf_lemmas.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Uni-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
ax = tf_idf_bi_grams.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
ax = tf_idf_nouns.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
ax = tf_idf_noun_phrases.head(30)[['tf-idf']].plot(kind='barh', width=0.99)
ax.set_title("TF-IDF of Bi-Grams")
ax.set_xlabel("TF-IDF")
ax.invert_yaxis();
plot_wordcloud(tf_idf_lemmas.to_dict()['tf-idf']);
plot_wordcloud(tf_idf_bi_grams.to_dict()['tf-idf']);
remove_tokens_subreddit = set(df.subreddit.str.lower().unique())
remove_tokens_subreddit
{'4runner',
'audi',
'bmw',
'harley',
'honda',
'hyundai',
'lexus',
'mazda3',
'mercedes_benz',
'mpselectminiowners',
'mustang',
'porsche',
'saab',
'subaru',
'teslamotors',
'trucks',
'volkswagen',
'volt',
'volvo',
'wrangler'}
tf_idf_lemmas_per_sub = tf_idf(
df=df,
tokens_column='partial_lemmas',
segment_columns = 'subreddit',
min_frequency_corpus=10,
min_frequency_document=10,
remove_tokens=remove_tokens | remove_tokens_subreddit
)
tf_idf_lemmas_per_sub.head(5)
| frequency | tf-idf | ||
|---|---|---|---|
| subreddit | token | ||
| 4Runner | gen | 74 | 283.40 |
| sr5 | 50 | 241.65 | |
| lift | 57 | 215.97 | |
| rear | 61 | 171.54 | |
| look | 148 | 162.27 |
tf_idf_bigrams_per_sub = tf_idf(
df=df,
tokens_column='bi_grams',
segment_columns = 'subreddit',
min_frequency_corpus=10,
min_frequency_document=10,
remove_tokens=remove_tokens_bi_grams
)
tf_idf_bigrams_per_sub.head(5)
| frequency | tf-idf | ||
|---|---|---|---|
| subreddit | token | ||
| 4Runner | _number_-4runner | 41 | 201.05 |
| 3rd-gen | 27 | 138.26 | |
| $-_number_ | 60 | 130.38 | |
| _number_-sr5 | 23 | 129.29 | |
| 4th-gen | 15 | 90.78 |
tf_idf_nouns_per_sub = tf_idf(
df=df,
tokens_column='nouns',
segment_columns = 'subreddit',
min_frequency_corpus=10,
min_frequency_document=10,
remove_tokens=remove_tokens | remove_tokens_subreddit
)
tf_idf_nouns_per_sub.head(5)
| frequency | tf-idf | ||
|---|---|---|---|
| subreddit | token | ||
| 4Runner | gen | 74 | 283.40 |
| sr5 | 50 | 241.65 | |
| lift | 42 | 178.41 | |
| mile | 82 | 146.48 | |
| trd | 26 | 146.16 |
tf_idf_nounphrases_per_sub = tf_idf(
df=df,
tokens_column='noun_phrases',
segment_columns = 'subreddit',
min_frequency_corpus=10,
min_frequency_document=10,
remove_tokens=remove_tokens_bi_grams
)
tf_idf_nounphrases_per_sub.head(5)
| frequency | tf-idf | ||
|---|---|---|---|
| subreddit | token | ||
| 4Runner | sway-bar | 12 | 65.78 |
| check-engine | 10 | 44.13 | |
| Harley | new-bike | 14 | 84.73 |
| spark-plug | 10 | 46.85 | |
| Honda | oil-change | 14 | 54.58 |
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit in ['Lexus', 'Volvo']").reset_index()
tokens_to_show.head()
| subreddit | token | frequency | tf-idf | |
|---|---|---|---|---|
| 0 | Lexus | is350 | 37 | 198.29 |
| 1 | Lexus | look | 166 | 182.01 |
| 2 | Lexus | mile | 101 | 180.31 |
| 3 | Lexus | gs | 30 | 160.77 |
| 4 | Lexus | drive | 103 | 157.12 |
px.bar(
tokens_to_show.groupby(['subreddit']).head(20).sort_values('tf-idf', ascending=True),
x='tf-idf',
y='token',
color='subreddit',
barmode='group',
title="Top 20 Lemmas for Volvo & Lexus"
)
tokens_to_show = tf_idf_bigrams_per_sub.query("subreddit in ['Lexus', 'Volvo']").reset_index()
tokens_to_show.head()
| subreddit | token | frequency | tf-idf | |
|---|---|---|---|---|
| 0 | Lexus | $-_number_ | 54 | 117.35 |
| 1 | Lexus | _number_-lexus | 19 | 105.00 |
| 2 | Lexus | f-sport | 16 | 94.55 |
| 3 | Lexus | _number_-is250 | 13 | 82.09 |
| 4 | Lexus | es-_number_ | 13 | 80.85 |
px.bar(
tokens_to_show.groupby(['subreddit']).head(20).sort_values('tf-idf', ascending=True),
x='tf-idf',
y='token',
color='subreddit',
barmode='group',
title="Top 20 Bi-Grams for Volvo & Lexus"
)
get_context_from_keyword(df.query("subreddit == 'Lexus'")['post'], keyword='think')
2045 ll me anything about it, what they |think| about it any problems they've had 2045 ems they've had etc. Also what you |think| of that price and mileage. I bough 3870 cheap and easy fix or do you guys |think| an insurance claim will end up nee 3952 ..<lb><lb>But what else? I am just |think| ing if they are about the same pric 3608 les and is currently at 169,7XX. I |think| the belts are pretty old but all i 3608 rk plugs are new. Idk what do yall |think| ?<lb><lb>Update: we put a code read 639 y forums such as Club Lexus, and I |think| it is an issue with the arms, beca 639 n the sunshade went up properly, I |think| it was because my coaxing got it p 639 han the other a little bit. Also I |think| may be catching in the slit that t 1061 what were your reasons? Overall I |think| both the cars are pretty evenly ma dtype: object
get_context_from_keyword(df.query("subreddit == 'Volvo'")['post'], keyword='think')
909 /i.imgur.com/5aaq3tB.jpg)<lb><lb>I |think| it's not the first time it happene 4099 mes on. I try to stay positive by |think| ing maybe my mechanic didn't top me 14 ience are the any other issues you |think| I should address to avoid future p 14 already said, it runs great and I |think| it has a lot of miles left on it i 3360 need your help reddit, what do you |think| , how much do you pay for every 10k 1659 e $500 for the same.<lb><lb>Do you |think| it's worth saving the $170 and goi 3692 exhaust as best I can, and I don’t |think| that should come loose from drivin 2226 As the title asks, I'm |think| ing about buying a 2010 C30 R-Desig 4260 down to like 300. What do you guys |think| ?<lb><lb> I really miss my 850t wag 1456 e to be rebuilt, or am I just over- |think| ing it? I just don't seem to notice dtype: object
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit == 'Lexus'").reset_index()
#tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']
plot_wordcloud(tokens_to_show);
tokens_to_show = tf_idf_lemmas_per_sub.query("subreddit == 'Volvo'").reset_index()
#tokens_to_show = tokens_to_show[~tokens_to_show.token.isin(stop_words)]
tokens_to_show = tokens_to_show[['token', 'tf-idf']].set_index('token')
tokens_to_show = tokens_to_show.to_dict()['tf-idf']
plot_wordcloud(tokens_to_show);
contexts = get_context_from_keyword(
documents=df['post'],
window_width=50,
keyword='replac',
num_samples = 20,
random_seed=42
)
for x in contexts:
print(x)
rger.<lb>* My Elantra still runs great and I just |replac| ed the tires.<lb>* The financials are all there, b for a while, but now the time has come for me to |replac| e my coil springs. I had heard great things about I'm |replac| ing the radio with a Pioneer AVH-X2800BS in my '04 <lb><lb>I sent it back to PA Performance and they |replac| ed all the internals. While I was waiting on PA Pe Also, the battery in the car seems like it needs |replac| ing as the interior lights flicker and the car has ere can confirm that before I go digging deep for |replac| ement panels. Thanks! and my steering stabilizer is shot. Do I need to |replac| e it to keep my tires healthy or not? I really do nyone knows the type of them? Or where can i find |replac| ements? I will add pictures, the small one is from is having some problems so that might need to be |replac| ed. Could that explain the floatiness? <lb><lb>Als <lb>The coolant reservoir tank seems to have been |replac| ed, I heard it's a common problem on these. Anyone else experience this? Dealer quote to |replac| e it was insane for what I imagine is a relatively I know the gasket needs to be |replac| ed and the mechanic i take mine too agreed to let I also asked Mazda service dealers for a quote in |replac| ing this part. The quotes varied wildly with some it not for corrosion around the old part.<lb><lb> |Replac| ing the vehicle speed sensor, or driven-speed gear lb>My boss has a 2010 Dodge Ram 1500 and wants to |replac| e his old, corroded bars with some new ones. Thing the things that I am seeing is pretty much about |replac| ing the entire intake manifold. Also seeing a lot My fault, I fucked up |replac| ing the door lock actuator and now my rear passeng may bite the bullet and buy a used door panel to |replac| e it with unless you guys have any suggestions. gs, they were original, appear to have aged well. |Replac| ed the plugs, and switched the coil pack from 1-2 every now and then it simply won't start. I have |replac| ed and checked the battery, but I'm thinking I may